Project 4 - Red Wine dataset - James Ward

This report explores a dataset containing expert quality scores and 12 other attributes for 1,599 different wines.

Initial Data Exploration

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Tough to get a “feel” for the data from these numbers, but they are helpful in interpreting the plots below.

Univariate Plots Section

Quality appears to follow a rough normal distribution with a mean of 5.636. It also appears that fractional scores were not allowed, making the data highly discrete. No wine scored below a 3, nor above an 8.

The distribution of most variables appears to follow a rough bell curve, though many of them show a strong positive skew away from zero. The distribution of citric acid amounts is also interesting in that it appears to be slightly bimodal.

I’ll try increasing the number of bins and cutting off outliers to see if we can get any additional insight.

By increasing the number of bins from 30 to 100, we can see that the data is actually much noisier than the original plots show. You can also begin to see that many of the variables become discrete - likely we are seeing the limits of the precision of the measurement methods (i.e. the instruments only allowed them to get to the nearest whole number, the nearest tenth, nearest one hundredth, etc., or they rounded).

The shape of the citric acid distribution becomes even more interesting, with pronounced peaks near 0, 0.25, and 0.5. These are very round numbers - does this indicate a limitation in the measurement method?

I transformed the graphs using log base 10 along the x axis to get a better look at the long tails of the distributions (using 50 bins this time).

I wonder how strong the relationships are between these variables and the perceived quality of the wine. To get a better sense for this, let’s graph these variables for the best wines (quality of 7-8) vs. the sub-par wines (quality of 3-5).

##        X          fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   8.0   Min.   : 4.900   Min.   :0.1200   Min.   :0.0000  
##  1st Qu.: 482.0   1st Qu.: 7.400   1st Qu.:0.3000   1st Qu.:0.3000  
##  Median : 939.0   Median : 8.700   Median :0.3700   Median :0.4000  
##  Mean   : 831.7   Mean   : 8.847   Mean   :0.4055   Mean   :0.3765  
##  3rd Qu.:1089.0   3rd Qu.:10.100   3rd Qu.:0.4900   3rd Qu.:0.4900  
##  Max.   :1585.0   Max.   :15.600   Max.   :0.9150   Max.   :0.7600  
##  residual.sugar    chlorides       free.sulfur.dioxide
##  Min.   :1.200   Min.   :0.01200   Min.   : 3.00      
##  1st Qu.:2.000   1st Qu.:0.06200   1st Qu.: 6.00      
##  Median :2.300   Median :0.07300   Median :11.00      
##  Mean   :2.709   Mean   :0.07591   Mean   :13.98      
##  3rd Qu.:2.700   3rd Qu.:0.08500   3rd Qu.:18.00      
##  Max.   :8.900   Max.   :0.35800   Max.   :54.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  7.00       Min.   :0.9906   Min.   :2.880   Min.   :0.3900  
##  1st Qu.: 17.00       1st Qu.:0.9947   1st Qu.:3.200   1st Qu.:0.6500  
##  Median : 27.00       Median :0.9957   Median :3.270   Median :0.7400  
##  Mean   : 34.89       Mean   :0.9960   Mean   :3.289   Mean   :0.7435  
##  3rd Qu.: 43.00       3rd Qu.:0.9973   3rd Qu.:3.380   3rd Qu.:0.8200  
##  Max.   :289.00       Max.   :1.0032   Max.   :3.780   Max.   :1.3600  
##     alcohol         quality     
##  Min.   : 9.20   Min.   :7.000  
##  1st Qu.:10.80   1st Qu.:7.000  
##  Median :11.60   Median :7.000  
##  Mean   :11.52   Mean   :7.083  
##  3rd Qu.:12.20   3rd Qu.:7.000  
##  Max.   :14.00   Max.   :8.000

Some of the distributions and means for the higher quality wines are significantly different. For instance, the higher quality wines have citric acid that is 39% higher than the overall average, with volatile acidity and total sulfur dioxide that are 23% and 25% lower than the overall average, respectively. Let’s look at the sup-par wines.

##        X          fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1.0   Min.   : 4.600   Min.   :0.1800   Min.   :0.0000  
##  1st Qu.: 298.8   1st Qu.: 7.100   1st Qu.:0.4600   1st Qu.:0.0800  
##  Median : 718.5   Median : 7.800   Median :0.5900   Median :0.2200  
##  Mean   : 750.1   Mean   : 8.142   Mean   :0.5895   Mean   :0.2378  
##  3rd Qu.:1227.2   3rd Qu.: 8.900   3rd Qu.:0.6800   3rd Qu.:0.3600  
##  Max.   :1598.0   Max.   :15.900   Max.   :1.5800   Max.   :1.0000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 1.200   Min.   :0.03900   Min.   : 3.00      
##  1st Qu.: 1.900   1st Qu.:0.07400   1st Qu.: 8.00      
##  Median : 2.200   Median :0.08100   Median :14.00      
##  Mean   : 2.542   Mean   :0.09299   Mean   :16.57      
##  3rd Qu.: 2.600   3rd Qu.:0.09400   3rd Qu.:23.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :68.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9926   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 23.75       1st Qu.:0.9961   1st Qu.:3.200   1st Qu.:0.5200  
##  Median : 45.00       Median :0.9969   Median :3.310   Median :0.5800  
##  Mean   : 54.65       Mean   :0.9971   Mean   :3.312   Mean   :0.6185  
##  3rd Qu.: 78.00       3rd Qu.:0.9979   3rd Qu.:3.400   3rd Qu.:0.6500  
##  Max.   :155.00       Max.   :1.0031   Max.   :3.900   Max.   :2.0000  
##     alcohol          quality     
##  Min.   : 8.400   Min.   :3.000  
##  1st Qu.: 9.400   1st Qu.:5.000  
##  Median : 9.700   Median :5.000  
##  Mean   : 9.926   Mean   :4.902  
##  3rd Qu.:10.300   3rd Qu.:5.000  
##  Max.   :14.900   Max.   :5.000

Not surprisingly, the sub-par wines often vary from the averages in the opposite direction. Volatile acidity is 12% higher, citric acid is 12% lower, and total sulfur dioxide is 18% higher.

Univariate Analysis

What is the structure of your dataset?

There are 1,599 wines in the dataset with 11 features (ficed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol). All features are numerical, rather than categorical. All attributes are continuous, except for quality, which only exists in discrete whole numbers.

Other observations: -the median quality is 6, while the mean is 5.636 -there are no quality scores below 3 or above 8 -density has a very narrow range of .99 to 1.0 -the alcohol content ranges from 8.4 to 14.9 with a mean of 10.4

What is the main feature of your dataset?

Quality is the most interesting variable, because it determines the value of the wine to the drinker (supposedly). The other variables are primarily interesting in how they affect the quality.

What other features in the dataset do you think will help support your investigation into your feature of interest?

The variables that appear to impact quality the most are volatile acidity, citric acid, and total sulfur dioxide.

Did you create any new variables from existing variables in the dataset?

No, the variables appear to be fairly independent, so it did not make much sense to combine them or create any derived variables in this case. The only exception appears to be free sulfur dioxide and total sulfur dioxide, which are clearly related, but the relationship is simply subtractive in nature, so there no value in creating a new variable for the difference between the two.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Quality is a highly discrete variable, but there is no way to change this after the data has already been collected. Citric acid also has a somewhat unusual distribution, but it doesn’t seem particularly problematic for our analysis.

Bivariate Plots Section

##                                 X fixed.acidity volatile.acidity
## X                     1.000000000   -0.26848392     -0.008815099
## fixed.acidity        -0.268483920    1.00000000     -0.256130895
## volatile.acidity     -0.008815099   -0.25613089      1.000000000
## citric.acid          -0.153551355    0.67170343     -0.552495685
## residual.sugar       -0.031260835    0.11477672      0.001917882
## chlorides            -0.119868519    0.09370519      0.061297772
## free.sulfur.dioxide   0.090479643   -0.15379419     -0.010503827
## total.sulfur.dioxide -0.117849669   -0.11318144      0.076470005
## density              -0.368372087    0.66804729      0.022026232
## pH                    0.136005328   -0.68297819      0.234937294
## sulphates            -0.125306999    0.18300566     -0.260986685
## alcohol               0.245122841   -0.06166827     -0.202288027
## quality               0.066452608    0.12405165     -0.390557780
##                      citric.acid residual.sugar    chlorides
## X                    -0.15355136   -0.031260835 -0.119868519
## fixed.acidity         0.67170343    0.114776724  0.093705186
## volatile.acidity     -0.55249568    0.001917882  0.061297772
## citric.acid           1.00000000    0.143577162  0.203822914
## residual.sugar        0.14357716    1.000000000  0.055609535
## chlorides             0.20382291    0.055609535  1.000000000
## free.sulfur.dioxide  -0.06097813    0.187048995  0.005562147
## total.sulfur.dioxide  0.03553302    0.203027882  0.047400468
## density               0.36494718    0.355283371  0.200632327
## pH                   -0.54190414   -0.085652422 -0.265026131
## sulphates             0.31277004    0.005527121  0.371260481
## alcohol               0.10990325    0.042075437 -0.221140545
## quality               0.22637251    0.013731637 -0.128906560
##                      free.sulfur.dioxide total.sulfur.dioxide     density
## X                            0.090479643          -0.11784967 -0.36837209
## fixed.acidity               -0.153794193          -0.11318144  0.66804729
## volatile.acidity            -0.010503827           0.07647000  0.02202623
## citric.acid                 -0.060978129           0.03553302  0.36494718
## residual.sugar               0.187048995           0.20302788  0.35528337
## chlorides                    0.005562147           0.04740047  0.20063233
## free.sulfur.dioxide          1.000000000           0.66766645 -0.02194583
## total.sulfur.dioxide         0.667666450           1.00000000  0.07126948
## density                     -0.021945831           0.07126948  1.00000000
## pH                           0.070377499          -0.06649456 -0.34169933
## sulphates                    0.051657572           0.04294684  0.14850641
## alcohol                     -0.069408354          -0.20565394 -0.49617977
## quality                     -0.050656057          -0.18510029 -0.17491923
##                               pH    sulphates     alcohol     quality
## X                     0.13600533 -0.125306999  0.24512284  0.06645261
## fixed.acidity        -0.68297819  0.183005664 -0.06166827  0.12405165
## volatile.acidity      0.23493729 -0.260986685 -0.20228803 -0.39055778
## citric.acid          -0.54190414  0.312770044  0.10990325  0.22637251
## residual.sugar       -0.08565242  0.005527121  0.04207544  0.01373164
## chlorides            -0.26502613  0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide   0.07037750  0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide -0.06649456  0.042946836 -0.20565394 -0.18510029
## density              -0.34169933  0.148506412 -0.49617977 -0.17491923
## pH                    1.00000000 -0.196647602  0.20563251 -0.05773139
## sulphates            -0.19664760  1.000000000  0.09359475  0.25139708
## alcohol               0.20563251  0.093594750  1.00000000  0.47616632
## quality              -0.05773139  0.251397079  0.47616632  1.00000000

Looking at this correlation table, quality appears to be most influenced by volatile acidity, citric acid, total sulfur dioxide, density, sulphates, and alcohol. Moving forward, I will focus on these variables.

Plotting a correlation matrix now.

Now plotting these variables vs. quality.

The discrete nature of the quality scores makes this a bit tougher to interpret, but the correlations can still be plainly seen in the plots.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Quality generally increases as volatile acidity decreases, which makes sense (acidity and the associated sourness are generally undesirable in wine). However, this same reasoning makes me surprised that higher citric acid actually increases the likely quality of the wine.

Sulfur dioxide and sulphates are used as perservatives in wine, and their relationships to quality does not seem particularly strong, so it probably makes economical sense to use them sparingly. Interestingly though, they have opposite relationships to quality. Perhaps this means that sulphates are a more desirable perservative, while sulfer dioxide is viewed as more of a contaminant. I wonder how they relate to one another, and if there is a cost difference or something in the creation process that would cause one to be more prevalent than the other.

Alcohol is positively correlated with quality, while density has a weaker negative correlation with quality. This makes intuitive sense because alcohol is less dense than water, so as the alcohol content increases, one would expect the density to decrease.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

There is a strong negative correlation between volatile acidity and citric acid, which does not make intuitive sense to me without knowing more. There is also the negative correlation between alcohol and density, which I noted above.

What was the strongest relationship you found?

Quality is most highly correlated with alcohol content.

Multivariate Plots

This helps us to visualize how likely manipulating each variable will have an impact on the quality of the wine. For instance, you can see that there are much fewer high quality wines with lower alcohol content.

Now let’s create a view that combines the two variables that are most highly correlated with quality.

This graph clearly shows a clump of lower-quality wines with low alcohol and high volatile acidity. It also shows that high quality wines generally have higher alcohol content and lower volatile acidity.

Alternately, we could switch the alcohol and volatile acidity variables to see if this plot provides any additional insights.

This plot reinforces the observations from the one above, though there are certainly some interesting outliers. All of the relationships observed so far also appear to be linear in nature, though the highly discrete nature of the quality scores makes it much more difficult to tell for certain. Therefore, tranforming axes with other scales (log, square root, etc.) does not seem to be appropriate.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

As noted above, the strongest relationships were those between quality and alcohol (positive correlation), and quality and volatile acidity (negative correlation). Because of the mathematical relationship between alcohol and density, there is also a significant negative correlation between those two variables.

Were there any interesting or surprising interactions between features?

I was surprised to see that two different agents that are listed as wine preservatives (according to my quick research online) had opposite relationships with wine quality. That is, one had a positive correlation while the other had a negative correlation.

Final Plots and Summary

Plot 1

Description 1

Wine scores are only awarded in whole numbers, making the data highly discrete. The scores also only range from 3 to 8, with the vast majority receiving 5’s or 6’s.

Plot 2

Description 2

Histograms of the two variables most highly correlated with wine scores. You can see that alcohol is clearly skewed in the positive direction, while volatile acidity resembles more of a centered bell curve, though with some interesting gaps.

Plot 3

Description 3

This plot shows wine scores plotted against its two most highly correlated variables, alcohol and volatile acidity. Higher quality wines clearly tend to have higher alcohol content and lower volatile acidity, though there are certainly some visible outliers.

Reflection

The fact that higher acidity leads to lower wine scores is certainly not surprising, but I did not previously understand the different types of acidity (fixed, volatile, citric), and I will have to do more research to understand why only one measure of acidity has a strong impact on quality. I am also surprised that some of the other factors did not have a more significant impact (residual sugar, pH, chlorides, etc.).

I was also somewhat surprised to see the strongest correlation being that between quality and alcohol content. In fact, I might have originally guessed that higher alcohol levels would overpower the taste of the wine, resulting in lower scores. Clearly, this is not the case. This makes me wonder if there is an upper limit, past which the alcohol content would actually make the scores go down. For instance, there are fortified red wines and desert red wines that can go beyond the alcohol levels in this data set. Though they may be considered to be in a different class of wine entirely, so I’m not sure that the same scoring system could be applied anyway. Additional testing with data beyond the current ranges would help to understand the relationships more thoroughly.

It would also help to implement a more precise scoring system for future studies. The highly discrete nature of the scores makes it difficult to study the exact nature of the relationships and the shapes of the plots. Ideally future studies will use a scoring system that produces scores of a more continuous nature. Still, with the data we have, we were able to identify some very clear correlations between the few key variables discussed above.